City of Boulder Accounts Payable Exploratory Data Analysis¶

Datasource : https://open-data.bouldercolorado.gov/datasets/10c2e57f741b45428c492f67aeb98b1b_0/explore

HomePage : https://johnkimaiyo.vercel.app/

Python Projects : https://johnkimaiyo.vercel.app/Pages/Python%20Projects/Python.html

In [1]:
# Load in some packages
import calendar
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")


# load dataset
boulder_df =pd.read_csv(r"C:\Users\jki\Downloads\Accounts_Payable (1).csv")  
boulder_df.head(6)
Out[1]:
vendor_name transaction_date transaction_amount organization_code fund fund_desc department department_desc object_code account_description ObjectId
0 PEASE, JUDITH L 2024/01/16 246.45 18550010 1100 General 185 Finance 670395 Misc Purchased Services 1
1 PUSH PEDAL PULL, INC 2024/01/15 470.00 51523032 2300 Recreation Activity 515 Parks and Recreation 670395 Misc Purchased Services 2
2 ECOARTS CONNECTIONS INC 2024/01/16 700.00 95111004 1100 General 951 Fundwide / Citywide 670395 Misc Purchased Services 3
3 XCEL ENERGY 2024/01/16 12.69 61015020 6100 Water Utility 610 Public Works - Utilities 680210 Gas & Electric 4
4 PUSH PEDAL PULL, INC 2024/01/15 691.64 51523031 2300 Recreation Activity 515 Parks and Recreation 630250 Materials: Equipment 5
5 GREATER WESTERN FENCE LLC 2024/01/15 2770.00 55520180 2500 Open Space 555 Open Space & Mountain Parks 660800 R&M: Land Improvements 6

After getting a sense of the data's structure, it is a good idea to look at a statistical summary of the variables with df.describe()

In [3]:
boulder_df.describe()
Out[3]:
transaction_amount organization_code fund department object_code ObjectId
count 3.815620e+05 3.815620e+05 381562.000000 381562.000000 381562.000000 381562.000000
mean 6.838728e+03 4.093152e+07 2992.008067 408.574481 649140.119310 190781.500000
std 1.101914e+05 1.740279e+07 2209.419437 173.147329 75219.369195 110147.606043
min -8.122976e+05 1.000000e+04 1.000000 0.000000 105200.000000 1.000000
25% 5.915250e+01 3.101502e+07 1100.000000 310.000000 630310.000000 95391.250000
50% 2.409200e+02 4.101005e+07 2200.000000 410.000000 660815.000000 190781.500000
75% 1.028215e+03 5.552016e+07 6100.000000 555.000000 670395.000000 286171.750000
max 3.682732e+07 9.716700e+07 8120.000000 951.000000 892101.000000 381562.000000
In [4]:
# lets  check for missing values
missing_values  = boulder_df.isna().sum()
print(missing_values)
vendor_name               0
transaction_date          0
transaction_amount        0
organization_code         0
fund                      0
fund_desc                 0
department                0
department_desc        8803
object_code               0
account_description       0
ObjectId                  0
dtype: int64
In [22]:
# lets check the data types
boulder_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381562 entries, 0 to 381561
Data columns (total 11 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   vendor_name          381562 non-null  object 
 1   transaction_date     381562 non-null  object 
 2   transaction_amount   381562 non-null  float64
 3   organization_code    381562 non-null  int64  
 4   fund                 381562 non-null  int64  
 5   fund_desc            381562 non-null  object 
 6   department           381562 non-null  int64  
 7   department_desc      372759 non-null  object 
 8   object_code          381562 non-null  int64  
 9   account_description  381562 non-null  object 
 10  ObjectId             381562 non-null  int64  
dtypes: float64(1), int64(5), object(5)
memory usage: 32.0+ MB
In [5]:
# lets remove missing values
boulder_df.dropna(subset=['department_desc'],inplace =True)
# lets confirm existance of missing values
missing_values =boulder_df.isna().sum()
print(missing_values)
vendor_name            0
transaction_date       0
transaction_amount     0
organization_code      0
fund                   0
fund_desc              0
department             0
department_desc        0
object_code            0
account_description    0
ObjectId               0
dtype: int64
In [36]:
# lets change the date data type
boulder_df['Date'] = pd.to_datetime(boulder_df['transaction_date'])
boulder_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 372759 entries, 0 to 381559
Data columns (total 14 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   vendor_name          372759 non-null  object        
 1   transaction_date     372759 non-null  object        
 2   transaction_amount   372759 non-null  int32         
 3   organization_code    372759 non-null  int64         
 4   fund                 372759 non-null  int64         
 5   fund_desc            372759 non-null  object        
 6   department           372759 non-null  int64         
 7   department_desc      372759 non-null  object        
 8   object_code          372759 non-null  int64         
 9   account_description  372759 non-null  object        
 10  ObjectId             372759 non-null  int64         
 11  Date                 372759 non-null  datetime64[ns]
 12  Month                372759 non-null  int32         
 13  Month Name           372759 non-null  object        
dtypes: datetime64[ns](1), int32(2), int64(5), object(6)
memory usage: 39.8+ MB
In [7]:
# Let's create a new variable, Month, from 'Order Date':

boulder_df['Month'] = boulder_df['Date'].dt.month
boulder_df['Month'].describe()
Out[7]:
count    372759.000000
mean          6.594961
std           3.431795
min           1.000000
25%           4.000000
50%           7.000000
75%          10.000000
max          12.000000
Name: Month, dtype: float64
In [15]:
# lets change to taransaction amoount to integer
boulder_df['transaction_amount'] = boulder_df['transaction_amount'].astype(int)
boulder_df['transaction_amount'].info()
<class 'pandas.core.series.Series'>
Int64Index: 372759 entries, 0 to 381559
Series name: transaction_amount
Non-Null Count   Dtype
--------------   -----
372759 non-null  int32
dtypes: int32(1)
memory usage: 4.3 MB
In [37]:
boulder_df.head(5)
Out[37]:
vendor_name transaction_date transaction_amount organization_code fund fund_desc department department_desc object_code account_description ObjectId Date Month Month Name
0 PEASE, JUDITH L 2024/01/16 246 18550010 1100 General 185 Finance 670395 Misc Purchased Services 1 2024-01-16 1 Jan
1 PUSH PEDAL PULL, INC 2024/01/15 470 51523032 2300 Recreation Activity 515 Parks and Recreation 670395 Misc Purchased Services 2 2024-01-15 1 Jan
2 ECOARTS CONNECTIONS INC 2024/01/16 700 95111004 1100 General 951 Fundwide / Citywide 670395 Misc Purchased Services 3 2024-01-16 1 Jan
3 XCEL ENERGY 2024/01/16 12 61015020 6100 Water Utility 610 Public Works - Utilities 680210 Gas & Electric 4 2024-01-16 1 Jan
4 PUSH PEDAL PULL, INC 2024/01/15 691 51523031 2300 Recreation Activity 515 Parks and Recreation 630250 Materials: Equipment 5 2024-01-15 1 Jan

1. What was the highest month that incurred the most expense? How much was spent?¶

In [16]:
# Load in some packages
import calendar
import warnings
import pandas as pd
import matplotlib.pyplot as plt
from itertools import combinations
from collections import Counter

warnings.filterwarnings("ignore")

# 1. What was the expensive month How much was expensed for  that month?

# Replace NaN or inf values in the 'Month' column with a default value (e.g., 0)
boulder_df['Month'] = boulder_df['Month'].fillna(0).astype(int)

# Convert month numbers to abbreviated month names
boulder_df['Month Name'] = boulder_df['Month'].apply(lambda x: calendar.month_abbr[x])

# Group by month and calculate total sales for each month
expense_by_month = boulder_df.groupby('Month Name').sum()['transaction_amount']

# Find the highest month for expense 
highest_month = expense_by_month.idxmax()
expense_for_best_month = expense_by_month.max()

print(f"The Highest month for highest expense was {highest_month} with expense of ${expense_for_best_month:,.2f}")
The Highest month for highest expense was Nov with expense of $259,930,331.00

2. Which Top 5 Vendors had the highest payout¶

In [38]:
# Plot the highest expense  for each vendor

expense_by_vendor = boulder_df.groupby('vendor_name').sum()['transaction_amount'] 

# Sort the values in descending order and select the top five
top_five_expense_by_vendor= expense_by_vendor.sort_values(ascending=False).head(5)

# Display the result
print(top_five_expense_by_vendor)

# Plot the top five County
top_five_expense_by_vendor.plot(kind='bar', color='brown', figsize=(10, 6))
plt.title('Top Five Vendors with the highest expense account')
plt.xlabel('Vendor Name')
plt.ylabel('Expense')
plt.show()
vendor_name
US BANK                                         242598382
FIDELITY NATIONAL TITLE COMPANY                  93114355
LAND TITLE GUARANTEE COMPANY                     53115030
NORTHERN COLORADO WATER CONSERVANCY DISTRICT     49781435
XCEL ENERGY                                      40165691
Name: transaction_amount, dtype: int32

3. Which top 5 Costs were the highest¶

In [39]:
# Plot the highest expense  

expense_by_cost_description = boulder_df.groupby('fund_desc').sum()['transaction_amount'] 

# Sort the values in descending order and select the top five
top_five_expense_by_cost_description = expense_by_cost_description.sort_values(ascending=False).head(5)

# Display the result
print(top_five_expense_by_cost_description)

# Plot the top five County
top_five_expense_by_cost_description.plot(kind='bar', color='yellow', figsize=(10, 6))
plt.title('Top Five Cost with the highest expense account')
plt.xlabel('fund_description')
plt.ylabel('Expense')
plt.show()
fund_desc
General               320767316
Water Utility         310597008
Transportation        222160900
Wastewater Utility    160834236
Open Space            134172438
Name: transaction_amount, dtype: int32

4 Which top 5 Departments had the highest bills¶

In [40]:
# Plot the highest expense  for each department

expense_by_department = boulder_df.groupby('department_desc').sum()['transaction_amount'] 

# Sort the values in descending order and select the top five
top_five_expense_by_department = expense_by_department.sort_values(ascending=False).head(5)

# Display the result
print(top_five_expense_by_department)

# Plot the top five County
top_five_expense_by_department.plot(kind='bar', color='purple', figsize=(10, 6))
plt.title('Top Five department with the highest expense account')
plt.xlabel('fund_desc​ription')
plt.ylabel('Expense')
plt.show()
department_desc
Public Works - Utilities         563521652
Public Works - Transportation    242354042
Fundwide / Citywide              228828532
Facilities & Fleet               172466157
Open Space & Mountain Parks      138763898
Name: transaction_amount, dtype: int32

5 Which top 5 Cost Centers had the highest bills¶

In [25]:
# Plot the highest expense  for each Centers

expense_by_cost_centers = boulder_df.groupby('account_description').sum()['transaction_amount'] 

# Sort the values in descending order and select the top five
top_five_expense_by_cost_centers = expense_by_cost_centers.sort_values(ascending=False).head(5)

# Display the result
print(top_five_expense_by_cost_centers)

# Plot the top five County
top_five_expense_by_department.plot(kind='bar', color='green', figsize=(10, 6))
plt.title('Top Five Centers with the highest expense account')
plt.xlabel('fund_description')
plt.ylabel('Expense')
plt.show()
account_description
Capital: Utility Infrastrcture    220308868
Community Funding/Grants          201435391
Capital: Infrastructure           197742579
Misc Purchased Services           177944536
Capital: Real Estate Acq          110795689
Name: transaction_amount, dtype: int32

6 Which vendor is commonly used¶

In [26]:
boulder_df['vendor_name'].value_counts()
Out[26]:
PCARD ONETIME VENDOR             44295
MCGUCKIN HARDWARE                12695
XCEL ENERGY                      10719
HOME DEPOT USA INC                7257
CENTURYLINK                       4484
                                 ...  
LEMUS, SONNY                         1
PADILLA, JOCELYN MARIE BUCHER        1
PADILLA, ISABEL G BUCHER             1
STADSKLEV, ERIKA                     1
LARRY RODGERS DESIGN GROUP           1
Name: vendor_name, Length: 8066, dtype: int64
In [29]:
expense_by_vendor = boulder_df['vendor_name'].value_counts()

# Sort the values in descending order and select the top five
top_five_expense_by_vendor = expense_by_vendor.sort_values(ascending=False).head(5)

# Display the result
print(top_five_expense_by_vendor )

# Plot the top five County
top_five_expense_by_vendor .plot(kind='bar', color='green', figsize=(10, 6))
plt.title('Top Five Vendors Commonly Used')
plt.xlabel('fund_description')
plt.ylabel('Expense')
plt.show()
PCARD ONETIME VENDOR    44295
MCGUCKIN HARDWARE       12695
XCEL ENERGY             10719
HOME DEPOT USA INC       7257
CENTURYLINK              4484
Name: vendor_name, dtype: int64

7 Provide a payable report that has a higher expense amount more than the average¶

In [31]:
boulder_df['transaction_amount'].mean()

rounded_mean = round(boulder_df['transaction_amount'].mean())
print(rounded_mean)
5431
In [32]:
boulder_df[boulder_df.where(boulder_df['transaction_amount']>5431,other=0).all(1)]
Out[32]:
vendor_name transaction_date transaction_amount organization_code fund fund_desc department department_desc object_code account_description ObjectId Date Month Month Name
9 IMPACT CHARITABLE 2024/01/15 6000 95111004 1100 General 951 Fundwide / Citywide 670395 Misc Purchased Services 10 2024-01-15 1 Jan
54 CITY OF LONGMONT 2024/01/12 25645 40515060 2910 Comm Development Block Grant 405 Housing 670184 Community Funding/Grants 55 2024-01-12 1 Jan
62 BROWN AND CALDWELL 2024/01/12 40303 61090010 6100 Water Utility 610 Public Works - Utilities 810730 Capital: Utility Infrastrcture 63 2024-01-12 1 Jan
64 MOUNTAIN VIEW INTEGRATION LLC 2024/01/12 11249 31510040 1100 General 315 Facilities & Fleet 660878 R&M: Software Maint Agreements 65 2024-01-12 1 Jan
78 TIERRA ROJO CORPORATION 2024/01/12 6065 14524430 2400 Climate Tax Fund 145 Climate Initiatives 670184 Community Funding/Grants 79 2024-01-12 1 Jan
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
381485 OUTPUT SERVICES INC 2014/12/20 9500 18520040 1100 General 185 Finance 620118 Postage and Express Mail 381486 2014-12-20 12 Dec
381490 NORTHERN COLORADO WATER CONSERVANCY DISTRICT 2014/12/19 500000 61090010 6100 Water Utility 610 Public Works - Utilities 670078 Water & Ditch Assmts: Other 381491 2014-12-19 12 Dec
381500 WSP USA ENVIRONMENT & INFRASTRUCTURE INC 2014/12/18 9976 61010010 6100 Water Utility 610 Public Works - Utilities 650720 Engineering Consultants 381501 2014-12-18 12 Dec
381519 INTEGRATING TECHNOLOGY & STANDARDS INC 2014/01/29 132961 61090010 6100 Water Utility 610 Public Works - Utilities 810730 Capital: Utility Infrastrcture 381520 2014-01-29 1 Jan
381559 LARRY RODGERS DESIGN GROUP 2005/05/26 10000 51595030 3300 Permanent Parks & Recreation 515 Parks and Recreation 650710 Architectural Consultants 381560 2005-05-26 5 May

32428 rows × 14 columns

8 Provide expense report on vendor US BANK above the average¶

In [33]:
f1 = boulder_df['transaction_amount']>5431
f2 = boulder_df['vendor_name']=="PCARD ONETIME VENDOR"
boulder_df[boulder_df.where(f1& f2,other=0).all(1)]
Out[33]:
vendor_name transaction_date transaction_amount organization_code fund fund_desc department department_desc object_code account_description ObjectId Date Month Month Name
2625 PCARD ONETIME VENDOR 2023/12/26 6386 25510010 1100 General 255 Police 710500 Business Travel 2626 2023-12-26 12 Dec
2628 PCARD ONETIME VENDOR 2023/12/26 7683 55520220 2500 Open Space 555 Open Space & Mountain Parks 630265 Materials: Tools 2629 2023-12-26 12 Dec
2703 PCARD ONETIME VENDOR 2023/12/26 8763 25510040 1100 General 255 Police 630250 Materials: Equipment 2704 2023-12-26 12 Dec
2707 PCARD ONETIME VENDOR 2023/12/26 11105 61040060 6200 Wastewater Utility 610 Public Works - Utilities 660866 R&M: Mach & Equip 2708 2023-12-26 12 Dec
2709 PCARD ONETIME VENDOR 2023/12/26 22335 19015050 7300 Computer Replacement Fund 190 Information Technology 690325 Cellular Phone Service 2710 2023-12-26 12 Dec
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
360506 PCARD ONETIME VENDOR 2015/06/25 6198 31065010 1100 General 310 Public Works - Transportation 630699 Materials: Miscellaneous 360507 2015-06-25 6 Jun
360507 PCARD ONETIME VENDOR 2015/06/25 24588 41005010 1100 General 410 Human Services 630275 Materials: Furniture 360508 2015-06-25 6 Jun
368776 PCARD ONETIME VENDOR 2015/04/25 8250 61060080 6300 Stormwater & Flood Mgt Utility 610 Public Works - Utilities 670395 Misc Purchased Services 368777 2015-04-25 4 Apr
368781 PCARD ONETIME VENDOR 2015/04/25 35449 18530030 7120 Workers Compensation Ins 185 Finance 630699 Materials: Miscellaneous 368782 2015-04-25 4 Apr
372484 PCARD ONETIME VENDOR 2015/03/25 5500 25515100 1100 General 255 Police 630245 Materials: Computer Software 372485 2015-03-25 3 Mar

334 rows × 14 columns

9 Provide top 3 Department with the highest bills¶

In [34]:
boulder_df.nlargest(3,'transaction_amount')
Out[34]:
vendor_name transaction_date transaction_amount organization_code fund fund_desc department department_desc object_code account_description ObjectId Date Month Month Name
337818 FIDELITY NATIONAL TITLE COMPANY 2015/12/04 36827319 95101671 6700 BMPA Debt Service 951 Fundwide / Citywide 810710 Capital: Real Estate Acq 337819 2015-12-04 12 Dec
222630 NORTHERN COLORADO WATER CONSERVANCY DISTRICT 2018/04/25 32677237 61090010 6100 Water Utility 610 Public Works - Utilities 810730 Capital: Utility Infrastrcture 222631 2018-04-25 4 Apr
154948 FIDELITY NATIONAL TITLE COMPANY 2019/11/06 9500000 40515040 2140 Affordable Housing 405 Housing 670184 Community Funding/Grants 154949 2019-11-06 11 Nov

10 Provide bottom 3 Department with the lowest bills¶

In [35]:
boulder_df.nsmallest(3,'transaction_amount')
Out[35]:
vendor_name transaction_date transaction_amount organization_code fund fund_desc department department_desc object_code account_description ObjectId Date Month Month Name
259495 PCARD ONETIME VENDOR 2017/07/25 -812297 18505005 1100 General 185 Finance 670395 Misc Purchased Services 259496 2017-07-25 7 Jul
267358 PCARD ONETIME VENDOR 2017/05/25 -612064 18505005 1100 General 185 Finance 670395 Misc Purchased Services 267359 2017-05-25 5 May
255394 PCARD ONETIME VENDOR 2017/08/25 -552550 18505005 1100 General 185 Finance 670395 Misc Purchased Services 255395 2017-08-25 8 Aug
In [ ]: